1 Introduction

Water kefir is a fizzy and mildly acidic beverage obtained through the process of fermenting a mixture of water, sugar, fruits, and water kefir grains. This particular drink has become more well-known as a result of its alleged probiotic qualities, which are thought to be advantageous to health. Additionally, water kefir is considered a potential vegan alternative to milk kefir. In order to gain a better understanding of the biological aspects and the mechanisms involved in water kefir production, several studies have aimed to identify the microbial species present in water kefir grains. It has been observed that the microbial communities in water kefir grains exhibit variations depending on factors such as the geographical origin of the grains and the specific ingredients used in the fermentation process. However, the majority of these communities consist of yeast, lactic acid bacteria, and acetic acid bacteria. Lactic and acetic acid bacteria are widely distributed groups of bacteria that possess the ability to metabolize sugars into acids. Consequently, these bacteria are commonly found in various types of fermented foods. Previous researches employing culture-based techniques have provided evidence for the presence of specific species, namely Lactobacillus harbinensis, Lactobacillus hilgardii, Lactobacillus paracasei, and Bifidobacterium aquikefiri, within various water kefir cultures. These particular species have demonstrated a high degree of conservation across different kefir cultures, as indicated by previous studies. However, in-depth exploration of the diversity and composition of water kefir has been limited, with only a few studies adopting metagenomic approaches such as shotgun metagenomics. The utilization of this DNA sequencing methodology enables comprehensive genomic analyses of all microorganisms present in a given sample, facilitating a precise determination of microbial species diversity within the water kefir community. Shotgun metagenomic sequencing is a powerful approach in microbial community analysis that integrates high-throughput sequencing technologies with computational pipelines. By sequencing the entire microbial communities present in multiple samples, we were able to investigate the stability and diversity of kefir microbiomes derived from two distinct starter cultures and fermented with either figs or raisins. Our study aimed to examine the variations between the initial microbial communities, their temporal composition changes, and how they were influenced by the surrounding environment.

2 Methods

Our experimental design involved cultivating kefir grains from two distinct starter cultures, namely Philipp starter and Vincent starter, using either raisins or figs as the fermentation substrate. To achieve a comprehensive genetic and metagenetic characterization of the kefir grains, we employed multiple methods in this study. Initially, we conducted whole genome sequencing on cultivable isolates derived from the starter culture kefir grains. This analysis utilized two complementary technologies: Illumina for generating short reads and Oxford Nanopore for producing long reads. These sequencing techniques were selected to ensure the generation of high-quality genome assemblies for our subsequent analyses. To assess the taxonomic diversity of our samples, we performed 16S amplicon sequencing, which was analyzed using the DADA2 workflow. Prior to analysis, we performed trimming and quality control steps using trimmomatic and cutadapt to remove the PCR primers utilized during the amplification process. Subsequently, fastQC was employed to verify the quality of the raw data obtained from the sequencing. The read counts were recorded both before and after trimming to ensure an adequate number of reads remained for downstream analysis.

The entire DADA2 analysis was conducted using R and the DADA2 package. The analysis began with the generation of quality profile plots. Subsequently, another trimming step was performed to retain only high-quality base pairs while ensuring sufficient overlap for Amplicon Sequence Variant (ASV) bridging in later stages. After trimming, the remaining reads were utilized to train an error rate model. Error plots were generated to verify that substitution errors fell within the expected range. Using the error rate model, errors were corrected, allowing for the inference of ASV variants. The corrected ASVs obtained from the previous step were merged to obtain complete paired-end ASV variants with an approximate length of 255 base pairs. These sequences were then used to construct a table of ASVs ranging from 250 to 256 base pairs. Following the removal of potential chimeric ASVs resulting from PCR artifacts, the taxonomic classification of ASVs was performed. The classified ASVs were saved and exported as a Phyloseq object, which was utilized to generate bar plots illustrating sample diversity. Subsequently, we proceeded with the further characterization of our samples using shotgun metagenomic sequencing. Similar to the 16S Amplicon data, initial steps of trimming and quality control were performed.

3 16s rRNA amplicon sequencing (DADA2 pipeline)

3.1 Setup

Why did we use the -g option ?
> We used the “-g” option to tell the algorithm to look for adapters that precede (5’) the sequence of interest and trim them.

Why is there ‘strange’ nucleotides in the primers (e.g. H, W, M..).?
> These symbols are IUPAC wildcards chracters and can refer to multiple bases, in the case of a damaged DNAs. For example, W means an A or a T.

3.2 Get files

# Forward and reverse fastq filenames have format: SAMPLENAME_1_paired_cut.fq.gz and SAMPLENAME_2_paired_cut.fq.gz
# PathReads contains the path to the files
FWDfiles <- sort(list.files(PathReads, pattern = "_1_paired_cut.fq.gz", full.names = TRUE))
REVfiles <- sort(list.files(PathReads, pattern = "_2_paired_cut.fq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format: SAMPLENAME_*.fq*  
sample.names <- sapply(strsplit(basename(FWDfiles), "_"), `[`, 1)
sample.names
##  [1] "K1"  "K10" "K11" "K12" "K13" "K14" "K15" "K16" "K17" "K18" "K19" "K2" 
## [13] "K20" "K21" "K22" "K23" "K24" "K25" "K26" "K27" "K28" "K29" "K3"  "K30"
## [25] "K31" "K32" "K33" "K34" "K4"  "K5"  "K6"  "K7"  "K8"  "K9"

Explain how this command line works: sapply(strsplit(basename(FWDfiles), "_"),[, 1)
> This command line is used to extract the sample name of each sample from the corresponding filename (K..). The basename function extracts the name for each file of the forward files, without the directory path. The strsplit function is used to split each filename obtained into multiple parts using the “_” as delimitor. The sapply function applies the operator “[” and passes the argument “1” to this operator, resulting in “[1]”, meaning we take the first element of the vector (the K..).

3.3 Quality scores

# Quality scores of R1 reads
plotQualityProfile(FWDfiles[5:8]) 
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Quality scores of R2 reads
plotQualityProfile(REVfiles[5:8])
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Quality scores for aggregated files
plotQualityProfile(FWDfiles[1:18], n = 1e+06, aggregate = TRUE) 
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

plotQualityProfile(REVfiles[1:18], n = 1e+06, aggregate = TRUE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

What do you think of these quality profiles?
> Regarding our reads, the quality is good at the beginning but decreases at the end. Here the green line represents the median quality score and the red line represents the proportion of reads that reached a certain length. Looking at the aggregated file, we have a good proportion of reads that have a sufficient quality score and also many reads reaching a sufficient length. We will however need to trim the data before the analysis.

Explain what is the option n = 1e+06?
> The option of n = 1e+06 is used to specify the number of reads to sample.

3.4 Trim the data

# Place filtered files in filtered/ subdirectory

PathTrim <- paste(PathToWD, "02_16SrRNA_data_trimmed_cut_trimmed", sep="/")

filtFWD <- file.path(PathTrim, paste0(sample.names, "_F_paired_cut_trimmed.fq.gz"))
filtREV <- file.path(PathTrim, paste0(sample.names, "_R_paired_cut_trimmed.fq.gz"))

names(filtFWD) <- sample.names
names(filtREV) <- sample.names

out <- filterAndTrim(FWDfiles, filtFWD, 
                     REVfiles, filtREV, 
                     truncLen=c(155,125), # There is around 84bp overlap
                     maxN=0, 
                     maxEE=c(2,2), 
                     truncQ=9, 
                     rm.phix=TRUE,
                     compress=TRUE, multithread=FALSE) 

# All parameters but truncLen are default DADA2 params

# Check this all makes sense
plotQualityProfile(filtFWD[1:18], n = 1e+06, aggregate=TRUE) 

plotQualityProfile(filtREV[1:18], n = 1e+06, aggregate=TRUE) 

Figure 1. Summary of the quality profiles of all our samples, as well as the aggregated.

Can you see a difference between these profiles and the ones before trimming?
> Compared to the quality profiles before trimming, we can see that the quality scores increased. We retained the majority of the reads with more than 976’000 reads remaining.

3.5 Error rates

errF <- learnErrors(filtFWD, randomize = TRUE, nbases = 3e8, multithread = TRUE) 
## 300455255 total bases in 1938421 reads from 32 samples will be used for learning the error rates.
errR <- learnErrors(filtREV, randomize = TRUE, nbases = 3e8, multithread = TRUE) 
## 258476750 total bases in 2067814 reads from 34 samples will be used for learning the error rates.
plotErrors(errF, nominalQ = TRUE)

plotErrors(errR, nominalQ = TRUE)

Figure 2. Log of the error frequency, as a function of the consensus quality score.

3.6 Dereplication

derepFWD <- derepFastq(filtFWD)
derepREV <- derepFastq(filtREV)

sample.names <- sapply(strsplit(basename(filtFWD),"_F_"),`[`,1)

names(derepFWD) <- sample.names
names(derepREV) <- sample.names

3.7 Sample inference

dadaFs <- dada(derepFWD, err = errF, multithread = TRUE)
dadaRs <- dada(derepREV, err = errR, multithread = TRUE)
dadaFs$K2
## dada-class: object describing DADA2 denoising results
## 275 sequence variants were inferred from 5130 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
dadaRs$K2
## dada-class: object describing DADA2 denoising results
## 222 sequence variants were inferred from 3405 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

What kind of objects are dadaFs and dadaRs?
> They are dada-class objects generated by denoising the forward (dadaF) or reverse (dadaR) reads. They refer to the outputs generated by the DADA2 algorithm during the sequence processing steps. These objects contain the results of denoising and error correction performed on the raw sequence data, precisely information such as the sequence variants, their abundances, and other associated metadata.

How many sequence variants (ASVs) are inferred?
> In the forward, 275 sequence variants were inferred from 5130 input unique sequences. In the reverse, 222 sequence variants were inferred from 3405 input unique sequences.

3.8 Merging reads

mergers <- mergePairs(dadaFs, derepFWD, dadaRs, derepREV, verbose = TRUE, trimOverhang = TRUE)
# Checking the dataframe
head(mergers[[1]])
##                                                                                                                                                                                                                                                        sequence
## 1 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGTCGTGCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGTTCGAAAGCGTGGGTAGCAAACAGG
## 2 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAGGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACCGGGAGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG
## 3 TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAAGCGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG
## 4 TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAACGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGTCGTGCATTGGAAACTGGAGAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGTTCGAAAGCGTGGGTAGCAAACAGG
## 5 TACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGG
## 6 TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAAGCGCATCGGAAACTGGAAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1     24231       1       1     27         0      0      2   TRUE
## 2     19093       2       2     27         0      0      2   TRUE
## 3      3937       4       3     27         0      0      2   TRUE
## 4      3338       3       1     27         0      0      2   TRUE
## 5      2852       6       4     27         0      0      2   TRUE
## 6      2805       5       3     27         0      0      2   TRUE

What information can you find in the merger object?
> The mergers object is a dataframe summarizing the stats for each sequence, such as the abundance, the number of matches and mismatches, etc. Each line is a sequence, and we have information about both the forward and reverse.

3.9 Sequence table

seqtab <- makeSequenceTable(mergers)

# Check distribution
table(nchar(getSequences(seqtab)))
## 
##  101  107  137  157  163  168  170  171  178  181  203  221  223  224  225  227 
##    1    1    1    1    1    1    2    1    1    1    1   11    3    2    1    5 
##  228  229  231  233  234  243  244  250  251  252  253  254  255  256  257  258 
##    1    1    1    1    1    1    3    6   10  379 6064  278   12    3    1    2 
##  260  263  265  267  268 
##    1    1    1    1    1
nrow(seqtab)
## [1] 34

Why do we expect ASVs within the above mentioned length range?
> We expect ASVs to be in this range, because we used Illumina sequencing and it is the usual length of the sequences. After merging the Fwd and Rev reads, we expect sequences in the range of [250:256] bp. However, some sequences may be shorter or longer than expected due to wrong merging.

What are the dimension of the seqtab?
> The seqtab has 34 rows (which are the sequences) and 6795 columns (which are the samples).

What can you say about the distribution of the sequence lengths?
> As said above, the length is mainly between 252 and 254, with the most common length being 253. This is because it is the size of our Illumina sequences.

Explain this command line table(nchar(getSequences(seqtab)))
> getSequences extracts the unique sequences from our seqtab object. nchar counts the number of nucleotides for each sequence. The function table then creates a frequency table from the vector of sequence lengths.

3.10 Remove sequences with length too distant

seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 252:254]

# Check distribution
table(nchar(getSequences(seqtab2)))
## 
##  252  253  254 
##  379 6064  278
# Dimensions
dim(seqtab2)
## [1]   34 6721
# Difference with seqtab
ncol(seqtab)-ncol(seqtab2) # 82 ASVs have been removed
## [1] 82

What are the dimensions of seqtab2?
> We have still 34 sequences but 6713 samples.

How many ASVs have been removed?
> 82 ASVs have been removed (6795-6713).

3.11 Remove chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab2, method = "consensus", multithread = TRUE, verbose = TRUE)
## Identified 1649 bimeras out of 6721 input sequences.
(1 - (ncol(seqtab.nochim)/ncol(seqtab2))) * 100
## [1] 24.53504
ncol(seqtab.nochim)
## [1] 5072

What is the percentage of chimeric sequences?
> To get the percentage of chimeric sequences, we can use the number of columns of our files, since every sequence is stored in a different column. We have 24.5% of chimeric sequences.

How many ASVs do you have in the end?
> We have 5067 ASVs in the end.

3.12 Assign taxonomy

taxa <- assignTaxonomy(seqtab.nochim, "/mnt/raidarray/home/SHARED/2023_SAGE2/SILVA/silva_nr99_v138_train_set.fa.gz", multithread=FALSE)
taxa2 <- addSpecies(taxa, "/mnt/raidarray/home/SHARED/2023_SAGE2/SILVA/silva_species_assignment_v138.fa.gz")

# Creating table for display
taxa.print <- taxa
rownames(taxa.print) <- NULL
head(taxa.print)
##      Kingdom    Phylum             Class            Order              
## [1,] "Bacteria" "Firmicutes"       "Bacilli"        "Lactobacillales"  
## [2,] "Bacteria" "Firmicutes"       "Bacilli"        "Lactobacillales"  
## [3,] "Bacteria" "Firmicutes"       "Bacilli"        "Lactobacillales"  
## [4,] "Bacteria" "Firmicutes"       "Bacilli"        "Lactobacillales"  
## [5,] "Bacteria" "Firmicutes"       "Bacilli"        "Lactobacillales"  
## [6,] "Bacteria" "Actinobacteriota" "Actinobacteria" "Bifidobacteriales"
##      Family               Genus          
## [1,] "Lactobacillaceae"   "Lactobacillus"
## [2,] "Lactobacillaceae"   "Lactobacillus"
## [3,] "Lactobacillaceae"   "Lactobacillus"
## [4,] "Lactobacillaceae"   "Lactobacillus"
## [5,] "Lactobacillaceae"   "Lactobacillus"
## [6,] "Bifidobacteriaceae" NA
PathTax <- paste(PathToWD, "03_Taxonomy", sep = "/")
dir.create(PathTax)
## Warning in dir.create(PathTax): '/mnt/raidarray/home/etu19/DADA2/03_Taxonomy'
## existe déjà
# Save the ASV sequences and the Taxonomy table
write.csv2(file = paste(PathTax, "Taxtable_dada2.csv", sep = "/"), taxa2)
write.csv2(file = paste(PathTax, "ASV_sequences.csv", sep = "/"), seqtab.nochim)

3.13 PhyloSeq object

# Open the data frame containing sample information
metaData <- read.csv("/mnt/raidarray/home/SHARED/2023_SAGE2/Metadata_kefir.csv", fill = T)
rownames(metaData) = metaData$id_sample

#Create a phyloseq object
ps_raw <- phyloseq(otu_table(seqtab.nochim, taxa_are_rows=F), 
               sample_data(metaData), 
               tax_table(taxa2))

meta(ps_raw) # This command allows to check the metadata of a PhyloSeq object
##     id_sample  treatment DNAconc inoculum   fruit  time    label
## K1         K1 2_vin_rais   77.77  vincent raisins    T2      s10
## K10       K10 2_vin_rais   19.15  vincent raisins    T2      s26
## K11       K11 1_vin_figs   60.69  vincent    figs    T1      s17
## K12       K12 4_phi_rais  246.36  philipp raisins    T4      s20
## K13       K13 1_vin_figs    8.38  vincent    figs    T1      s46
## K14       K14 2_vin_rais   72.45  vincent raisins    T2      s22
## K15       K15 3_phi_figs   72.17  philipp    figs    T3      s29
## K16       K16 4_phi_rais   76.66  philipp raisins    T4      s28
## K17       K17    0_start  130.74    start    figs START philipp2
## K18       K18 2_vin_rais  159.86  vincent raisins    T2       s6
## K19       K19 1_vin_figs   30.34  vincent    figs    T1      s37
## K2         K2    0_start   40.81    start    figs START philipp1
## K20       K20 4_phi_rais  142.11  philipp raisins    T4      s36
## K21       K21 2_vin_rais   87.73  vincent raisins    T2      s14
## K22       K22 3_phi_figs   77.94  philipp    figs    T3      s41
## K23       K23    0_start   30.24    start    figs START vincent1
## K24       K24 1_vin_figs   36.59  vincent    figs    T1       s9
## K25       K25 4_phi_rais  157.40  philipp raisins    T4       s8
## K26       K26 3_phi_figs  135.71  philipp    figs    T3      s35
## K27       K27 3_phi_figs   69.05  philipp    figs    T3      s19
## K28       K28    0_start   13.75    start    figs START vincent2
## K29       K29 1_vin_figs   61.98  vincent    figs    T1      s29
## K3         K3 4_phi_rais   57.10  philipp raisins    T4      s16
## K30       K30 3_phi_figs  113.28  philipp    figs    T3      s31
## K31       K31 4_phi_rais  132.41  philipp raisins    T4      s40
## K32       K32 4_phi_rais   40.72  philipp raisins    T4      s32
## K33       K33 3_phi_figs   75.24  philipp    figs    T3       s3
## K34       K34 4_phi_rais   73.05  philipp raisins    T4      s12
## K4         K4 1_vin_figs   46.57  vincent    figs    T1      s33
## K5         K5 3_phi_figs   42.10  philipp    figs    T3       s7
## K6         K6 2_vin_rais   35.52  vincent raisins    T2      s18
## K7         K7 4_phi_rais   24.82  philipp raisins    T4       s4
## K8         K8 2_vin_rais   44.89  vincent raisins    T2      s38
## K9         K9 1_vin_figs   39.14  vincent    figs    T1       s5
##               NovoID student_Number      Family_Name First_name AllFactor
## K1  FKDN220086523-1A             10           Mottet     Leonie       ALL
## K10 FKDN220086532-1A             26         Cergneux     Julien       ALL
## K11 FKDN220086533-1A             17 Tharmakulasinkam   Karunnya       ALL
## K12 FKDN220086534-1A             20           Corset    Margaux       ALL
## K13 FKDN220086535-1A             46          Hafez 2     Daniel       ALL
## K14 FKDN220086536-1A             22             Jann  Alexandre       ALL
## K15 FKDN220086537-1A             29          Brochet     Silvia       ALL
## K16 FKDN220086538-1A             28           Ndiaye     Malick       ALL
## K17 FKDN220086539-1A             44            Engel    Philipp       ALL
## K18 FKDN220086540-1A              6      Liakopoulos     Petros       ALL
## K19 FKDN220086541-1A             37             Burz  Sebastian       ALL
## K2  FKDN220086524-1A             45            Engel    Philipp       ALL
## K20 FKDN220086542-1A             36          Matthey      Noemi       ALL
## K21 FKDN220086543-1A             14             Peng   Xiaojing       ALL
## K22 FKDN220086544-1A             41       Somerville    Vincent       ALL
## K23 FKDN220086545-1A             43       Somerville    Vincent       ALL
## K24 FKDN220086546-1A              9            Hafez     Daniel       ALL
## K25 FKDN220086547-1A              8          Gwyther     Philip       ALL
## K26 FKDN220086548-1A             35    Bonilla-Rosso     German       ALL
## K27 FKDN220086549-1A             19          Estelle     Vivien       ALL
## K28 FKDN220086550-1A             42       Somerville    Vincent       ALL
## K29 FKDN220086551-1A             29            Engel    Philipp       ALL
## K3  FKDN220086525-1A             16          Buvelot   Mathilde       ALL
## K30 FKDN220086552-1A             31        de Bakker    Vincent       ALL
## K31 FKDN220086553-1A             40          Liberti    joanito       ALL
## K32 FKDN220086554-1A             32     Garrido-Sanz     Daniel       ALL
## K33 FKDN220086555-1A              3             Moix     Samuel       ALL
## K34 FKDN220086556-1A             12          Dentand     Alexis       ALL
## K4  FKDN220086526-1A             33    Sarton-Loheac    Garance       ALL
## K5  FKDN220086527-1A              7  Garrido Marques    Antonio       ALL
## K6  FKDN220086528-1A             18        de Coning     Elindi       ALL
## K7  FKDN220086529-1A              4          Schmidt    Camille       ALL
## K8  FKDN220086530-1A             38           Gibson      Paddy       ALL
## K9  FKDN220086531-1A              5        Ben Amara      Cyril       ALL

Which command allows you to check the sample metadata from the PhyloSeq object?
> This command allows to check the metadata: meta(ps_raw).

What is added to the phyloseq object with this command line merge_phyloseq(ps_raw, dna)?
> The merge_phyloseq() function is used to add the ASV sequences to the existing ps_raw phyloseq object. The dna object, which contains the ASV sequences associated with their respective taxonomic units, is merged into the ps_raw object. This addition allows the phyloseq object to have both taxonomic information and the corresponding ASV sequences.

Explain which data are we merging with this command line table = merge(tax_table(ps_raw), t(otu_table(ps_raw)), by=“row.names”)
> The merge() function is used to merge two tables, the taxonomic table and the OTU (Operational Taxonomic Unit) table. The tax_table(ps_raw) function retrieves the taxonomic table from the ps_raw phyloseq object, while otu_table(ps_raw) retrieves the OTU table. The merging is done based on the row names, which typically represent the taxonomic identifiers or names of each taxonomic unit. The resulting merged table, stored in the table variable, combines the taxonomic information and abundance data for each taxon.

What does the phyloseq object contains in terms of data ?
> The phyloseq object contains the OTU Table (Operational Taxonomic Unit) or ASV in each sample, taxonomic table: (taxonomic information for each OTU or ASV), sample Data (sample-specific metadata) and refseq object contains the reference sequences associated with the taxonomic units.

What do you think the next steps will be? Is is useful to save the ASVs sequences?
> Yes it is useful to save the ASVs sequences for potential next steps, such as statistical analysis, visualization, or downstream analysis using the phyloseq object (ps_raw).

Have a look at the Taxtable_dada2.csv and ASV_sequences.csv files. What information can you find in them ?
> ASV_sequences.csv contains the actual nucleotide sequences of the ASVs. Taxtable_dada2.csv contains the taxonomic information associated with each ASV.

3.14 Number of reads

List each step where filtering happened in this pipeline.
> We filtered with fastQC and trimmomatic. We trimmed again based on several parameters such as maxN, maxEE, truncQ etc. We did trimmomatic, cutadapt, we filtered, denoised the forward and reverse, merged and removed the nonchim. We lost the most reads on the filtering step (see plot below).

# Loading the table 'read_counts_overview_FWD.txt' and setting the column names
preDADA2 <- read.table("/mnt/raidarray/home/SHARED/2023_SAGE2/read_counts_overview_FWD.txt", sep = "\t")
colnames(preDADA2) <- c("Sample_name", "Before", "After_trimmomatic", "After_cutadapt")

# ordering the table by the sample names   
preDADA2 <- preDADA2[order(preDADA2$Sample_name),]

getN <- function(x) sum(getUniques(x))
reads_counts <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))

colnames(reads_counts) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(reads_counts) <- sample.names

# Check dimensions
dim(preDADA2) ; dim(reads_counts)
## [1] 34  4
## [1] 34  6
# Check samples order
preDADA2[,1] ; rownames(reads_counts)
##  [1] "K1_1.fq.gz "  "K10_1.fq.gz " "K11_1.fq.gz " "K12_1.fq.gz " "K13_1.fq.gz "
##  [6] "K14_1.fq.gz " "K15_1.fq.gz " "K16_1.fq.gz " "K17_1.fq.gz " "K18_1.fq.gz "
## [11] "K19_1.fq.gz " "K2_1.fq.gz "  "K20_1.fq.gz " "K21_1.fq.gz " "K22_1.fq.gz "
## [16] "K23_1.fq.gz " "K24_1.fq.gz " "K25_1.fq.gz " "K26_1.fq.gz " "K27_1.fq.gz "
## [21] "K28_1.fq.gz " "K29_1.fq.gz " "K3_1.fq.gz "  "K30_1.fq.gz " "K31_1.fq.gz "
## [26] "K32_1.fq.gz " "K33_1.fq.gz " "K34_1.fq.gz " "K4_1.fq.gz "  "K5_1.fq.gz " 
## [31] "K6_1.fq.gz "  "K7_1.fq.gz "  "K8_1.fq.gz "  "K9_1.fq.gz "
##  [1] "K1"  "K10" "K11" "K12" "K13" "K14" "K15" "K16" "K17" "K18" "K19" "K2" 
## [13] "K20" "K21" "K22" "K23" "K24" "K25" "K26" "K27" "K28" "K29" "K3"  "K30"
## [25] "K31" "K32" "K33" "K34" "K4"  "K5"  "K6"  "K7"  "K8"  "K9"
# Join the dataframes
ReadsTrack <- cbind(sample.names, preDADA2[,2:4], reads_counts[,2:6])

# Compute the fraction of kept reads 
ReadsTrack$fracKept <- ReadsTrack$nonchim / ReadsTrack$Before

# Vizualize the fraction of kept reads for each samples
ggplot(ReadsTrack, aes(x = sample.names, y = fracKept))+
  geom_bar(stat = "identity") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90))

# Save Boxplot and read track table
boxplot(ReadsTrack[, -c(1, 10)], ylim = c(0, 120000), xlab = "Steps of the analysis", ylab = "Number of reads kept", cex = 0.5)

Figure 3. Number of reads kept after each step of the analysis. The steps are input, filtered, denoisedF, denoisedR, merged and nonchim.

# Finally save the reads count table
write.csv(ReadsTrack, paste0(PathToWD, "/Reads_tracking.csv"), row.names = FALSE)

From which package the function getUniques comes from ? Briefly, what does it produce as output?
> This function comes from the package dada2. It extracts the uniques-vector from several different data objects, including dada-class and derep-class objects, as well as data.frame. It creates an integer vector named by unique sequence and valued by abundance.

Why is the read count similar between the After_trimmomatic and After_cutadapt steps?
> The read count is similar between these two steps because their functions are quite similar, both are in charge of removing the adapters and the low-quality reads.

3.15 Track reads

div <- function(x, y){ (1 - (x/y)) * 100 }
lostReads <- div(ReadsTrack[, 3:9], ReadsTrack[, 2])
head(lostReads)
##   After_trimmomatic After_cutadapt filtered denoisedF denoisedR   merged
## 3          1.332314       1.332314 39.36528  39.92098  39.67797 41.03322
## 1          1.476246       1.476246 37.67920  38.41318  38.12078 39.89653
## 2          1.535759       1.535759 34.02447  34.61180  34.27644 35.55845
## 4          1.038442       1.038442 38.05583  38.56459  38.35372 39.51684
## 5          2.311714       2.311714 48.83288  49.00913  48.96800 49.54177
## 6          2.643139       2.643139 52.35951  52.54531  52.44564 53.10766
##    nonchim
## 3 43.05633
## 1 41.71017
## 2 37.63186
## 4 41.02556
## 5 51.19257
## 6 54.80330
averageLost <- mean(lostReads$nonchim)
lostperSpecies <- lostReads$nonchim
names(lostperSpecies) <- ReadsTrack$sample.names

averageLost
## [1] 44.40997
lostperSpecies
##       K1      K10      K11      K12      K13      K14      K15      K16 
## 43.05633 41.71017 37.63186 41.02556 51.19257 54.80330 57.12436 60.31704 
##      K17      K18      K19       K2      K20      K21      K22      K23 
## 61.48127 57.44805 53.27990 42.50547 58.77055 54.34547 58.34894 48.37818 
##      K24      K25      K26      K27      K28      K29       K3      K30 
## 49.86821 31.44042 33.30494 33.13039 26.19974 37.33019 45.27746 33.62564 
##      K31      K32      K33      K34       K4       K5       K6       K7 
## 32.88389 33.91757 33.15235 35.00574 39.90740 50.15377 49.46577 42.87434 
##       K8       K9 
## 43.79889 37.18310
min(ReadsTrack$nonchim)
## [1] 34279

What is the average lost reads ? At which step did you loose the most reads ?
> On average, we lost 44% of our reads. We lost the most reads during our trimming of the data (function filterAndTrim()).

Did a sample lose more reads than all the other ?
> No, overall, the loss is well distributed across our samples. The maximum lost is sample K17 with more than 61% of loss.

Do you think enough reads are remaining ?
> Yes, as we started with a huge number of reads we still have a significant amount. For example, the minimum number of reads for a sample is still 34280 for the K18.

4 ASV-isolate matching

Why for targeted metagenomics (or amplicon sequencing) the 16S rRNA gene is used?
> The 16S rRNA gene is commonly used in targeted metagenomics or amplicon sequencing due to its ubiquity, conserved regions, universality, and extensive database availability. It serves as a phylogenetic marker, allowing to analyze the diversity and relationships of microbial communities. Universal primers designed for this gene enable the simultaneous amplification of multiple taxa. Its evolutionary conservation facilitates comparative studies across different samples and the inference of evolutionary relationships.

Examine the file. How many 16S rRNA genes have been identified in our collection of assembled genomes?
> By looking at the fasta file, we see that we have 82 sequences. As it is forward and reverse, we have 41 16S rRNA genes that have been identified in our collection of assembled genomes.

What is a query? How many queries are in the blast output?
> The queries are each sequence in our fasta file, so we have 82 queries.

What is the difference between a query and its hits?
> A query is the sequence we give and the hits are the results, so the potential matches.

4.1 Processing BLAST results

# Load blast results
blast_res = read.table("/mnt/raidarray/home/etu19/Blast_16SrRNAs_vs_ASVs_results.txt",
                       header = F, 
                       sep = "\t", 
                       comment.char = "#")
head(blast_res)
##                                            V1      V2      V3  V4 V5 V6  V7  V8
## 1 16S_rRNA::ESL0963_contig_1:403998-405554(+)    ASV4 100.000 253  0  0 549 801
## 2 16S_rRNA::ESL0963_contig_1:403998-405554(+)   ASV17  99.605 253  1  0 549 801
## 3 16S_rRNA::ESL0963_contig_1:403998-405554(+)    ASV1  98.819 254  1  2 549 801
## 4 16S_rRNA::ESL0963_contig_1:403998-405554(+)    ASV9  98.425 254  2  2 549 801
## 5 16S_rRNA::ESL0963_contig_1:403998-405554(+) ASV1333  98.024 253  5  0 549 801
## 6 16S_rRNA::ESL0963_contig_1:403998-405554(+)  ASV548  97.638 254  4  2 549 801
##   V9 V10       V11 V12 V13
## 1  1 253 1.88e-132 468  16
## 2  1 253 8.75e-131 462  16
## 3  1 253 1.89e-127 451  16
## 4  1 253 8.82e-126 446  16
## 5  1 253 4.10e-124 440  16
## 6  1 253 1.91e-122 435  16

Do you know what comment.char = “#” does?
> comment.char specifies which character should be used as a comment character. Every line which starts with this character will be ignored during the import.

# Extract the column names
blast_cols = read.csv2("/mnt/raidarray/home/etu19/Blast_16SrRNAs_vs_ASVs_results.txt",
                  header = F, 
                  sep = ",",
                  skip = 3)[1,]

# Rename fields to avoid weird characters and spaces
blast_cols = gsub("# Fields: ", "", blast_cols)
blast_cols = gsub("^ ", "", blast_cols)
blast_cols = gsub(" ", ".", blast_cols)
blast_cols = gsub("%", "perc", blast_cols)

# Assign names to blast_res column names
colnames(blast_res) = blast_cols
head(blast_res)
##                                 query.acc.ver subject.acc.ver perc.identity
## 1 16S_rRNA::ESL0963_contig_1:403998-405554(+)            ASV4       100.000
## 2 16S_rRNA::ESL0963_contig_1:403998-405554(+)           ASV17        99.605
## 3 16S_rRNA::ESL0963_contig_1:403998-405554(+)            ASV1        98.819
## 4 16S_rRNA::ESL0963_contig_1:403998-405554(+)            ASV9        98.425
## 5 16S_rRNA::ESL0963_contig_1:403998-405554(+)         ASV1333        98.024
## 6 16S_rRNA::ESL0963_contig_1:403998-405554(+)          ASV548        97.638
##   alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1              253          0         0      549    801        1    253
## 2              253          1         0      549    801        1    253
## 3              254          1         2      549    801        1    253
## 4              254          2         2      549    801        1    253
## 5              253          5         0      549    801        1    253
## 6              254          4         2      549    801        1    253
##      evalue bit.score perc.query.coverage.per.subject
## 1 1.88e-132       468                              16
## 2 8.75e-131       462                              16
## 3 1.89e-127       451                              16
## 4 8.82e-126       446                              16
## 5 4.10e-124       440                              16
## 6 1.91e-122       435                              16

Do you know what the last part of the read.csv2 command skip = 3)[1,] is doing ?
> The skip argument specifies the number of lines of the data file to skip before beginning to read data. In this case, it skips the first 3 lines. The [1,] says that we keep only the header.

What is the gsub() function doing ?
> gsub() searches for a specific substring in a string, and replaces it with another substring (specified in the arguments). In our case, it formats the headers so that they are more readable.

# Split query names into columns
blast_res$query.acc.ver = gsub("16S_rRNA::", "", blast_res$query.acc.ver)
blast_res = blast_res %>% separate(query.acc.ver, 
                                   c("Genome", "contig", "coords", "strand"),
                                   sep = "_contig_|:|\\(|\\)")

head(blast_res)
##    Genome contig        coords strand subject.acc.ver perc.identity
## 1 ESL0963      1 403998-405554      +            ASV4       100.000
## 2 ESL0963      1 403998-405554      +           ASV17        99.605
## 3 ESL0963      1 403998-405554      +            ASV1        98.819
## 4 ESL0963      1 403998-405554      +            ASV9        98.425
## 5 ESL0963      1 403998-405554      +         ASV1333        98.024
## 6 ESL0963      1 403998-405554      +          ASV548        97.638
##   alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1              253          0         0      549    801        1    253
## 2              253          1         0      549    801        1    253
## 3              254          1         2      549    801        1    253
## 4              254          2         2      549    801        1    253
## 5              253          5         0      549    801        1    253
## 6              254          4         2      549    801        1    253
##      evalue bit.score perc.query.coverage.per.subject
## 1 1.88e-132       468                              16
## 2 8.75e-131       462                              16
## 3 1.89e-127       451                              16
## 4 8.82e-126       446                              16
## 5 4.10e-124       440                              16
## 6 1.91e-122       435                              16

Do you understand what the _contig_|:|\\(|\\) within the separate() function is doing ?
> This argument specifies that the value should be separated at “\contig\”, “:”, “(” and “)”. These characters will also be removed. This way, the value will be split in 4 columns: contig, coordinates, strand and the ASV id.

# Filter BLAST
filt_threshold = 100

blast_filt = filter(blast_res, perc.identity >= filt_threshold)
head(blast_filt)
##    Genome contig          coords strand subject.acc.ver perc.identity
## 1 ESL0963      1   403998-405554      +            ASV4           100
## 2 ESL0963      1 1720439-1721994      -            ASV4           100
## 3 ESL0961      1   268380-269947      +            ASV5           100
## 4 ESL0965      1   267413-268980      +            ASV5           100
## 5 ESL0967      1   267163-268730      +            ASV5           100
## 6 ESL0969      1   269573-271140      +            ASV5           100
##   alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1              253          0         0      549    801        1    253
## 2              253          0         0      548    800        1    253
## 3              253          0         0      556    808        1    253
## 4              253          0         0      556    808        1    253
## 5              253          0         0      556    808        1    253
## 6              253          0         0      556    808        1    253
##      evalue bit.score perc.query.coverage.per.subject
## 1 1.88e-132       468                              16
## 2 1.88e-132       468                              16
## 3 1.89e-132       468                              16
## 4 1.89e-132       468                              16
## 5 1.89e-132       468                              16
## 6 1.89e-132       468                              16
# Count the different ASVs
length(unique(blast_filt$subject.acc.ver))
## [1] 10

Can you count how many different ASVs matched at a 100% sequence identity our assembled genomes ?
> There are 10 different ASVs that matched at a 100% sequence identity with our assembled genomes.

# Create a dataframe with 3 columns
genome_tax = data.frame("Genome"=c("ESL0961", "ESL0965", "ESL0967", "ESL0969",
                                   "ESL0962", "ESL0970", "ESL0968_iso_01", 
                                   "ESL0976", "ESL0964", "ESL0966", "ESL0968_iso_02", 
                                   "ESL0971", "ESL0972", "ESL0974", "ESL0963"),
                        "Copy.no.16S" = c("5","5","5","5","5","5","5","5",
                                      "6", "6", "6", "6", "6", "6", "6"),
                        "Taxa"=c("Lacticaseibacillus paracasei", 
                                 "Lacticaseibacillus paracasei",
                                 "Lacticaseibacillus paracasei", 
                                 "Lacticaseibacillus paracasei",
                                 "Lentilactobacillus parabuchneri", 
                                 "Lentilactobacillus parabuchneri", 
                                 "Lentilactobacillus parabuchneri",
                                 "Lentilactobacillus hilgardii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus ghanensis"))
head(genome_tax)
##    Genome Copy.no.16S                            Taxa
## 1 ESL0961           5    Lacticaseibacillus paracasei
## 2 ESL0965           5    Lacticaseibacillus paracasei
## 3 ESL0967           5    Lacticaseibacillus paracasei
## 4 ESL0969           5    Lacticaseibacillus paracasei
## 5 ESL0962           5 Lentilactobacillus parabuchneri
## 6 ESL0970           5 Lentilactobacillus parabuchneri
# Add this information to the blast results
blast_filt = left_join(blast_filt, genome_tax, by="Genome")
head(blast_filt)
##    Genome contig          coords strand subject.acc.ver perc.identity
## 1 ESL0963      1   403998-405554      +            ASV4           100
## 2 ESL0963      1 1720439-1721994      -            ASV4           100
## 3 ESL0961      1   268380-269947      +            ASV5           100
## 4 ESL0965      1   267413-268980      +            ASV5           100
## 5 ESL0967      1   267163-268730      +            ASV5           100
## 6 ESL0969      1   269573-271140      +            ASV5           100
##   alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1              253          0         0      549    801        1    253
## 2              253          0         0      548    800        1    253
## 3              253          0         0      556    808        1    253
## 4              253          0         0      556    808        1    253
## 5              253          0         0      556    808        1    253
## 6              253          0         0      556    808        1    253
##      evalue bit.score perc.query.coverage.per.subject Copy.no.16S
## 1 1.88e-132       468                              16           6
## 2 1.88e-132       468                              16           6
## 3 1.89e-132       468                              16           5
## 4 1.89e-132       468                              16           5
## 5 1.89e-132       468                              16           5
## 6 1.89e-132       468                              16           5
##                             Taxa
## 1 Liquorilactobacillus ghanensis
## 2 Liquorilactobacillus ghanensis
## 3   Lacticaseibacillus paracasei
## 4   Lacticaseibacillus paracasei
## 5   Lacticaseibacillus paracasei
## 6   Lacticaseibacillus paracasei

Do you know what the left_join() function is doing? How the two data.frames (blast_res and genome_tax) are being combined ?
> The left_join() function is taking all the common genomes between the two data frames, getting the corresponding lines from the right dataframe (all columns) and pasting these columns to the left dataframe.

# ASVs that matched at 100%

plot_match = ggplot(blast_filt, aes(y=paste0(Taxa," ",Genome))) +
  geom_bar(aes(fill=subject.acc.ver)) + #ASVs will be represented as bars
  geom_point(aes(x=-1.5, color=Taxa), size=2.5) + #points will represent the genomes taxa
  geom_text(aes(x=-3.2), label=blast_filt$Copy.no.16S)+ #number of 16S rRNA copies per genome
  ggtitle("Isolate-ASV matching")+
  #Color scale for the Species based on assembled genomes
  scale_color_manual(values=c("coral", 
                                "bisque4",
                                "lightgoldenrod2",
                                "aquamarine3",
                                "deepskyblue3"))+
  #Color scale for the ASVs matching
  scale_fill_d3("category20c")+
  #ticks, breaks and labels of the x scale
  scale_x_continuous(position = "top",
                     breaks = c(-3.2,0,2,4,6,8,10,15,20),
                     labels = c("16S rRNA \ncopies",0,2,4,6,8,10,15,20)
                     ) +
  #Title of X axis
  xlab(paste0(filt_threshold, "% ASV match count")) + 
  #titles for the legend.
  labs(fill="ASV", color="Species (whole-genome)") +
  #theme and theme options
  theme_classic() + 
  theme(axis.title.y = element_blank(), 
        axis.line = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x.top = element_text(),
        legend.key.size = unit(0.4, 'cm'),
        legend.key.height = unit(0.4, 'cm'),
        legend.key.width = unit(0.4, 'cm'))

plot_match

# Specific ASV matching each of the 16S rRNa copies

ggplot(blast_filt, aes(y=paste0(Taxa," ",Genome," ", coords))) +
  geom_bar(aes(fill=subject.acc.ver)) +
  geom_point(aes(x=-0.2, color=Genome), size=2, shape=15) +
  scale_color_d3("category20")+
  scale_fill_d3("category20c") +
  scale_x_continuous(position = "top")+
  xlab(paste0(filt_threshold, "% ASV match count")) + 
  labs(fill="ASV", color="Genome") +
  theme_classic() + 
  theme(axis.title.y = element_blank(), 
        axis.line = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x.top = element_text(),
        axis.text.y = element_text(size = 5),
        legend.key.size = unit(0.3, 'cm'),
        legend.key.height = unit(0.3, 'cm'),
        legend.key.width = unit(0.3, 'cm'))

Figure 4. For each genome, the number of ASV matching at 100% and which species they are representing.

Why most genomes have multiple ASVs matching ?
> Because they have many copies of 16S rRNA.

How can you explain that L. paracasei genomes match 8 ASV while these genomes only contain 5 copies of the 16S rRNA gene? and others like L. nagelii that have 6 copies of the 16S rRNA match with less ?
> This is because of the BLAST results, where a contig can match different ASVs with 100%. This is due to several genes containing the same sequnence. It can also be that one ASV is 1 nucelotide longer. Furthermore, if we lower the threshold (99% match for example), the number of matches increases. We notice that ASVs are not the whole gene but just a part of it. therefore less precise.

4.2 Abundance of isolates

ps <- readRDS("/mnt/raidarray/home/etu19/DADA2/04_Phyloseq_object/PhyloSeq_Object.rds")
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5066 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 5066 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 5066 reference sequences ]
ncol(ps@otu_table)
## [1] 5066

From the above output (executing directly the name of the phyloseq object), how many ASVs are there? How many samples ?
> There are 5066 ASVs in our PhyloSeq object, and there are 34 samples.

ps@otu_table[1:5,1:5]
## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##      ASV1  ASV2 ASV3  ASV4 ASV5
## K1  24232 19093 3937  3337 2805
## K10 12045  8152 2716  3604 1970
## K11 11960 34964 4263  2762 2951
## K12 11729 11190 1675 10797 1467
## K13 11792  1136 1064   858  737

The otu_table contains raw sequence numbers or relative abundance ?
> The otu_table contains raw sequence numbers.

# Convert into relative abundance

ps_ra = transform_sample_counts(ps, function(x) x/sum(x))
ps_ra@otu_table[1:5,1:5]
## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##          ASV1       ASV2       ASV3       ASV4       ASV5
## K1  0.3615745 0.28489361 0.05874541 0.04979259 0.04185443
## K10 0.2446381 0.16556991 0.05516289 0.07319847 0.04001137
## K11 0.1701329 0.49736835 0.06064184 0.03928988 0.04197843
## K12 0.1664136 0.15876619 0.02376527 0.15319022 0.02081412
## K13 0.2365496 0.02278837 0.02134403 0.01721163 0.01478435
# Look at the top 250 ASVs

toptax = names(sort(taxa_sums(ps), decreasing=TRUE))[1:250]
ps.toptax = transform_sample_counts(ps, function(x) {x/sum(x)})
ps.toptax <- prune_taxa(toptax, ps.toptax)

plot_bar(ps.toptax, x="id_sample", fill="Phylum") +
  ylab("Relative abundance") + xlab("") +
  ggtitle("Phylum level", "Top 250 ASVs") +
  theme(legend.key.size = unit(0.3, 'cm'), #change legend key size
        legend.key.height = unit(0.3, 'cm'), #change legend key height
        legend.key.width = unit(0.3, 'cm'), #change legend key width
        legend.title = element_blank(), #remove legend title font size
        legend.text = element_text(size=7), #change legend text font size
        legend.position = "top", #change legend position
        axis.text.x = element_text(vjust = 0.5))

Figure 5. Top 250 ASVs, showing which phylum they represent and in which abundance, for each sample.

Which phyla dominate the water kefir samples? Do you observe dramatic changes across samples ?
> The Firmicutes dominate the water kefir samples. There are no dramatic changes, although Proteobacteria and Actinobacteriota can get up to 50 and 40% respectively.

# Retrieve a list of the exact ASVs
matching_asvs = unique(blast_filt$subject.acc.ver)
matching_asvs
##  [1] "ASV4"    "ASV5"    "ASV3"    "ASV8"    "ASV17"   "ASV1"    "ASV9"   
##  [8] "ASV2"    "ASV4438" "ASV16"
# Filter
ps_raASVs = prune_taxa(colnames(otu_table(ps_ra)) %in% matching_asvs, ps_ra)
ps_raASVs
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 10 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 10 reference sequences ]

What is the colnames(otu_table(ps_ra)) %in% matching_asvs doing ?
> This code is checking which ASVs are present in matching_asvs, and keeping them for the prune_taxa() function.

# Add info to phyloseq object
tax = as.data.frame(ps_raASVs@tax_table)
tax$ASV = rownames(tax)

#Add tax again to the phyloseq object
tax_table(ps_raASVs) = as.matrix(tax)
# Plot at the species level
sp_plot = plot_bar(ps_raASVs, x="id_sample", fill="Species") +
  ylab("Relative abundance") + xlab("") +
  ggtitle("Species level", paste0(filt_threshold, "% ASV matching")) +
  theme(legend.key.size = unit(0.3, 'cm'), #change legend key size
        legend.key.height = unit(0.3, 'cm'), #change legend key height
        legend.key.width = unit(0.3, 'cm'), #change legend key width
        legend.title = element_blank(), #remove legend title font size
        legend.text = element_text(size=7), #change legend text font size
        legend.position = "top", #change legend position
        axis.text.x = element_text(vjust = 0.5)) #vertically center x labels

# Plot at the genus level
gen_plot = plot_bar(ps_raASVs, x="id_sample", fill="Genus") +
  ylab("Relative abundance") + xlab("") +
  ggtitle("Genus level", paste0(filt_threshold, "% ASV matching")) +
  theme(legend.key.size = unit(0.3, 'cm'),
        legend.key.height = unit(0.3, 'cm'),
        legend.key.width = unit(0.3, 'cm'),
        legend.title = element_blank(),
        legend.text = element_text(size=7),legend.position = "top")

ggarrange(sp_plot, #plot 1
          gen_plot + theme(axis.title.y = element_blank()), #plot2 
          ncol = 2, 
          align = "h",
          labels = "AUTO") #puts the A, B... of the panels.

Do the genomes that we obtained during SAGE I represent a big part of the community ?
> Yes they represent a big part, because the relative abundance is close to 100%.

How many of the matching ASVs are classified at the Species level? Do you observe the 5 different species identified in the collection of assembled genomes ?
> There are 2 matching ASVs: L. paracasei and L. ghanensis. No, we don’t observe the 5 different species identified in the collection of assembled genome.

Check also the classification of the matching ASVs at the Genus level. What do you see? Do you see the 3 different genera identified in the collection of the assembled genomes ?
> We only have Lactobacillus and Reyranella, but the latter is almost not present.

How can you explain the differences observed in the taxonomy of ASVs at the genus and species levels compared with the taxonomy of the assembled genomes ?
> The difference in taxonomy between ASVs and assembled genomes is due to the smaller resolution of ASVs, compared to genomes. So, there can be sequence errors and artifacts. Of course, ASVs do not provide a complete genome representation, so variation in other genomic regions can be missed.

# Coloring by ASV

asv_plot <- plot_bar(ps_raASVs, x="id_sample", fill="ASV") +
  ylab("Relative abundance") +
  ggtitle("ASV level", paste0(filt_threshold, "% ASV matching")) +
  theme(legend.key.size = unit(0.3, 'cm'),
        legend.key.height = unit(0.3, 'cm'), 
        legend.key.width = unit(0.3, 'cm'), 
        legend.title = element_blank(), 
        legend.text = element_text(size=7),
        legend.position = "top")

asv_plot

Can you generate a plot with ggarrange() containing the three plots? at the ASV level, Species level and Genus level ?
> See code and plot below.

ggarrange(sp_plot,
          gen_plot + theme(axis.title.y = element_blank()), 
          asv_plot + theme(axis.title.y = element_blank()),
          ncol = 3,
          align = "h",
          labels = "AUTO")

Figure 6. Top 250 ASVs matching at 100%, at the species, genus and ASV level.

# Filtering BLAST results

ASV_reclass = blast_filt %>% select(Taxa, "ASV"=subject.acc.ver) %>% unique()
ASV_reclass
##                               Taxa     ASV
## 1   Liquorilactobacillus ghanensis    ASV4
## 3     Lacticaseibacillus paracasei    ASV5
## 8     Lacticaseibacillus paracasei    ASV3
## 9     Lacticaseibacillus paracasei    ASV8
## 17  Liquorilactobacillus ghanensis   ASV17
## 35    Liquorilactobacillus nagelii    ASV1
## 62    Liquorilactobacillus nagelii    ASV9
## 70    Lentilactobacillus hilgardii    ASV2
## 75    Liquorilactobacillus nagelii ASV4438
## 76 Lentilactobacillus parabuchneri   ASV16
# Reassign the taxonomy
tax2 = left_join(tax, ASV_reclass, by="ASV")
rownames(tax2) = tax2$ASV

# Add tax again to the phyloseq object
tax_table(ps_raASVs) = as.matrix(tax2)
ps_raASVs
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 10 taxa by 9 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 10 reference sequences ]
# Plot
reclass_ASVs = plot_bar(ps_raASVs, x="id_sample", fill="Taxa") +
  ylab("Relative abundance") +
  ggtitle("Reclassified ASVs", paste0(filt_threshold, "% ASV matching")) +
  scale_fill_manual(values=c("coral", 
                                "bisque4",
                                "lightgoldenrod2",
                                "aquamarine3",
                                "deepskyblue3"))
reclass_ASVs

What are the differences you observe with this plot with reassignation of ASVs taxonomy based on whole-genome information compared to the previous ones based on the taxonomic classification of ASVs agains the SILVA database ?
> We found 5 different species based on their whole genome, whereas before we were only able to find 1 or 2.

What is causing these differences ?
> We can find information outside of the 16S which allows for better classification and taxonomy.

reclass_type <- plot_bar(ps_raASVs, x="label", fill="Taxa") +
  ylab("Relative abundance") +
  ggtitle("Reclassified ASVs") +
  scale_fill_manual(values=c("coral", 
                                "bisque4",
                                "lightgoldenrod2",
                                "aquamarine3",
                                "deepskyblue3"))+
  facet_wrap(~inoculum+fruit, scales = "free_x", nrow = 1)
reclass_type

Figure 7. Reclassified ASVs, at the species level, grouped by the origin and food of the species.

Can you create a composite figure showing the reclassification of ASVs at these two different thresholds (100% and 99%) ?
> See code and plots below.

# Filter BLAST
filt_threshold = 99
blast_filt_99 = filter(blast_res, perc.identity >= filt_threshold)
# Count the different ASVs
length(unique(blast_filt_99$subject.acc.ver))
## [1] 13
# Create a dataframe with 3 columns
genome_tax_99 = data.frame("Genome"=c("ESL0961", "ESL0965", "ESL0967", "ESL0969",
                                   "ESL0962", "ESL0970", "ESL0968_iso_01", 
                                   "ESL0976", "ESL0964", "ESL0966", "ESL0968_iso_02", 
                                   "ESL0971", "ESL0972", "ESL0974", "ESL0963"),
                        "Copy.no.16S" = c("5","5","5","5","5","5","5","5",
                                          "6", "6", "6", "6", "6", "6", "6"),
                        "Taxa"=c("Lacticaseibacillus paracasei", 
                                 "Lacticaseibacillus paracasei",
                                 "Lacticaseibacillus paracasei", 
                                 "Lacticaseibacillus paracasei",
                                 "Lentilactobacillus parabuchneri", 
                                 "Lentilactobacillus parabuchneri", 
                                 "Lentilactobacillus parabuchneri",
                                 "Lentilactobacillus hilgardii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus nagelii", 
                                 "Liquorilactobacillus ghanensis"))
# Add this information to the blast results
blast_filt_99 = left_join(blast_filt_99, genome_tax_99, by="Genome")
head(blast_filt_99)
##    Genome contig          coords strand subject.acc.ver perc.identity
## 1 ESL0963      1   403998-405554      +            ASV4       100.000
## 2 ESL0963      1   403998-405554      +           ASV17        99.605
## 3 ESL0963      1 1720439-1721994      -            ASV4       100.000
## 4 ESL0963      1 1720439-1721994      -           ASV17        99.605
## 5 ESL0961      1   268380-269947      +            ASV5       100.000
## 6 ESL0961      1   268380-269947      +            ASV3        99.605
##   alignment.length mismatches gap.opens q..start q..end s..start s..end
## 1              253          0         0      549    801        1    253
## 2              253          1         0      549    801        1    253
## 3              253          0         0      548    800        1    253
## 4              253          1         0      548    800        1    253
## 5              253          0         0      556    808        1    253
## 6              253          1         0      556    808        1    253
##      evalue bit.score perc.query.coverage.per.subject Copy.no.16S
## 1 1.88e-132       468                              16           6
## 2 8.75e-131       462                              16           6
## 3 1.88e-132       468                              16           6
## 4 8.75e-131       462                              16           6
## 5 1.89e-132       468                              16           5
## 6 8.82e-131       462                              16           5
##                             Taxa
## 1 Liquorilactobacillus ghanensis
## 2 Liquorilactobacillus ghanensis
## 3 Liquorilactobacillus ghanensis
## 4 Liquorilactobacillus ghanensis
## 5   Lacticaseibacillus paracasei
## 6   Lacticaseibacillus paracasei
# Retrieve a list of the exact ASVs
matching_asvs_99 = unique(blast_filt_99$subject.acc.ver)
matching_asvs_99
##  [1] "ASV4"    "ASV17"   "ASV5"    "ASV3"    "ASV8"    "ASV1"    "ASV9"   
##  [8] "ASV2"    "ASV366"  "ASV25"   "ASV3323" "ASV4438" "ASV16"
# Filter
ps_raASVs_99 = prune_taxa(colnames(otu_table(ps_ra)) %in% matching_asvs_99, ps_ra)
ps_raASVs_99
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 13 taxa and 34 samples ]
## sample_data() Sample Data:       [ 34 samples by 12 sample variables ]
## tax_table()   Taxonomy Table:    [ 13 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 13 reference sequences ]
# Add info to phyloseq object
tax_99 = as.data.frame(ps_raASVs_99@tax_table)
tax_99$ASV = rownames(tax_99)
#Add tax again to the phyloseq object
tax_table(ps_raASVs_99) = as.matrix(tax_99)
# Filtering BLAST results
ASV_reclass_99 = blast_filt_99 %>% select(Taxa, "ASV"=subject.acc.ver) %>% unique()
# Reassign the taxonomy
tax2_99 = left_join(tax_99, ASV_reclass_99, by="ASV")
rownames(tax2_99) = tax2_99$ASV
# Add tax again to the phyloseq object
tax_table(ps_raASVs_99) = as.matrix(tax2_99)
# Plot
reclass_ASVs_99 = plot_bar(ps_raASVs_99, x="id_sample", fill="Taxa") +
    ylab("Relative abundance") +
    ggtitle("Reclassified ASVs", paste0(filt_threshold, "% ASV matching")) +
    scale_fill_manual(values=c("coral", 
                               "bisque4",
                               "lightgoldenrod2",
                               "aquamarine3",
                               "deepskyblue3"))
ggarrange(reclass_ASVs, #plot 1
          reclass_ASVs_99 + theme(axis.title.y = element_blank()), #plot2 
          ncol = 2,
          labels = "AUTO",
          common.legend = TRUE, 
          legend = "right")

reclass_type_99 <- plot_bar(ps_raASVs_99, x="label", fill="Taxa") +
    ylab("Relative abundance") +
    ggtitle("Reclassified ASVs") +
    scale_fill_manual(values=c("coral", 
                               "bisque4",
                               "lightgoldenrod2",
                               "aquamarine3",
                               "deepskyblue3"))+
    facet_wrap(~inoculum+fruit, scales = "free_x", nrow = 1)

ggarrange(reclass_type, #plot 1
          reclass_type_99 + theme(axis.title.y = element_blank()), #plot2 
          ncol = 2, 
          align = "h",
          labels = "AUTO",
          common.legend = TRUE, 
          legend = "right")

Figure 8. Comparison of reclassified ASVs, at a 100% (A) and 99% (B) matching. On the bottom figure, it is grouped by treatment.

5 Diversity analyses

5.1 Filtering

rank_names(ps)
## [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"
get_taxa_unique(ps,"Kingdom")
## [1] "Bacteria" "Archaea"
ps <- subset_taxa(ps, Kingdom != "Archaea" | is.na(Kingdom))
ps <- subset_taxa(ps, Order != "Chloroplast" | is.na(Order))
ps <- subset_taxa(ps, Family != "Mitochrondria" | is.na(Family))

5.2 Rarefaction

pl.rare.all <- ggrare(ps, step = 1000, label = "Sample", se = FALSE) + xlab("Sequencing Depth") + ylab("ASV Richness")

Figure 9. Rarefaction plot of all the samples.

How well-sampled are our samples? Would you recommend more sequencing? Or would you recommend less sequencing depth next time?
> There are well-sampled because they all reach a plateau. Sequencing depth is good, as it allow for the ASVs with a low species richness to have a sufficient sequencing depth to be able to compare them with ASVs that have a high species richness.

How heterogeneous is the sequencing depth?
> It is not too heterogeneous, with the lowest sequencing depth being around 63’000 and the highest around 100’000.

How fast are the curves flattening? What does this mean?
> Some of then are flattening really fast, like the K15 and K24 (flatten at 10’000). On the opposite side, samples with a higher species richness flatten later, like K34 (flattens at 75’000). It means that there are less diversity in the ones that flatten quickly. So, adding ASVs doesn’t increase the diversity.

Can we use these samples like this? How could it possibly affect our analyses?
> Yes, we can use the samples like this, because the smaller samples (K6 and K15) have a sequencing depth at which all other samples have reached their plateau so there is no need to rarefy. Otherwise we would lose information.

5.3 Prevalence

# Calculate prevalence across all samples
ps.prev <- estimate_prevalence(ps,group="AllFactor",rarefy = FALSE)
ps.prev$samplePrev <- ps.prev$prevalence * 34 # Convert to absolute sample numbers
ps.prev$relAbund <- ( ps.prev$abundance / sum(ps.prev$abundance) ) # Calculate relative abundance 

#Plot
pl.prev.his <- ggplot(ps.prev, aes(x=samplePrev))+
  geom_histogram(colour="steelblue",fill="steelblue", binwidth = 1, boundary = -0.5) +
  scale_x_continuous(breaks = 1:34) +
  xlab("Number of samples in which ASVs are found") +
  ylab("Number of ASVs") +
  stat_bin(binwidth=1, geom='text', colour="black", size = 2, aes(label=..count..), position=position_stack(vjust = 1.1))
pl.prev.his

## Compare prevalence with relative abundance
# select the "Phylum" taxonomic assignation for your ASVs
ps.tax.class <- data.frame(tax_table(ps)[,2]) 
# Plot
pl.prev.sct <- ggplot(ps.prev, aes(x=samplePrev, y=relAbund, color=ps.tax.class$Phylum)) +
  geom_point() +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8),
        legend.position="bottom") +
  xlab("Number of samples in which ASVs are found") +
  ylab("Relative abundance")
pl.prev.sct

Figure 10. Number of samples in which ASVs are found, y-axis is the number of ASVs (top) and the relative abundance (bottom).

Does the majority of the ASVs have high or low prevalence?
> The majority of ASVs has a low prevalence. The prevalence is the number of sample where a ASV is found. We have very few ASV that are in lots of sample. We have 3’000 ASV that are in only one sample (contamination, misassignation, chimera) these are not super meaningful. They are called singleton. For waterkefir, all originated from the same culture.

Are there any ASVs occurring in all samples? Were we expecting it?
> There are 9 ASVs occuring in all 34 samples. As they all come from the same culture at the beginning we do expect that.

Are the most abundant ASVs dominating a single or few samples or are they present in all samples?
> We have 5 or 10 ASV that are in all sample. The ones that are present in all sample we expect them to be highly abundant. As we expected the ones that are present in only one sample have very low abundance. And the one the are present in all sample are the most abundant, which is nice, they all belong to the same phylum. This is expected as it is the dominating genus.

5.4 Richness

# Calculate how many total ASVs 
tmp_df <- t(otu_table(ps))
ps.asv <- data.frame(colSums(tmp_df!=0))
colnames(ps.asv) <- "ASVs"
rm(tmp_df)

# Now add the total read number
ps.asv$reads <- sample_sums(ps)
colnames(ps.asv) <- c("ASVs", "reads")
ps.asv$sampleID <- rownames(ps.asv)

# Make barplots for each sample:
pl.asv.bar <- ggplot(ps.asv, aes(x = sampleID, y = ASVs) ) + 
  ggtitle("Total ASVs per sample") + 
  ylab("ASV number") +
  geom_bar(stat="identity", fill="steelblue")+
  theme(axis.text.x = element_text(angle=90))

pl.red.bar <- ggplot(ps.asv, aes(x = sampleID, y = reads) ) + 
  ggtitle("Total reads per sample") + 
  ylab("Read number") +
  geom_bar(stat="identity", fill="indianred")+
  theme(axis.text.x = element_text(angle=90))

# Histogram of ASVs and read counts
pl.asv.his <- ggplot(ps.asv, aes(x = reads)) + 
  geom_histogram(color = "black", fill = "indianred", binwidth = 5000) +
  ggtitle("Total reads distribution") + 
  xlab("Reads") +
  ylab("Samples")+
  theme(axis.text.x = element_text(angle=90))

pl.red.his <- ggplot(ps.asv, aes(x = ASVs)) + 
  geom_histogram(color = "black", fill = "steelblue", binwidth = 50) +
  ggtitle("Total ASV distribution") + 
  xlab("ASVs") +
  ylab("Samples")+
  theme(axis.text.x = element_text(angle=90))

# Visualize
ggarrange(pl.red.bar, pl.asv.bar, pl.asv.his, pl.red.his, ncol=2, nrow=2)

Figure 11. Visualisation of the number of reads and ASVs per sample, and the number of samples per read and per ASVs.

summary(ps.asv)
##       ASVs           reads         sampleID        
##  Min.   : 52.0   Min.   :34228   Length:34         
##  1st Qu.:106.5   1st Qu.:46330   Class :character  
##  Median :202.5   Median :55012   Mode  :character  
##  Mean   :238.4   Mean   :56753                     
##  3rd Qu.:318.8   3rd Qu.:67024                     
##  Max.   :599.0   Max.   :84590

What can you learn from these plots?
> In red, we can see the total of reads per sample and the distribution of these reads. In blue, we can see the total of ASVs per sample and the distribution.

How variable is the total ASV richness across samples?
> The number of ASVs varies between 52 and 599, with a mean of 238.4, so we see that it is highly variable.

summary(lm(ps.asv$ASVs ~ ps.asv$reads))
## 
## Call:
## lm(formula = ps.asv$ASVs ~ ps.asv$reads)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -163.72  -67.53  -21.64   61.54  218.60 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.659e+02  7.559e+01  -3.518  0.00133 ** 
## ps.asv$reads  8.886e-03  1.298e-03   6.848 9.56e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.44 on 32 degrees of freedom
## Multiple R-squared:  0.5944, Adjusted R-squared:  0.5817 
## F-statistic: 46.89 on 1 and 32 DF,  p-value: 9.56e-08
pl.asv_red.scp <- ggplot(ps.asv, aes(x=reads, y=ASVs)) + 
  geom_point()+
  geom_smooth(method=lm)

## Compare ASVs against DNA yields
# Add DNA yields from the sample metadata
tmp1 <- sample_data(ps)[,"DNAconc"]
ps.asv2 <- merge(ps.asv, tmp1, by="row.names", all=TRUE)  
rownames(ps.asv2) <- ps.asv2$sampleID
ps.asv2[,"Row.names"] <- NULL

# Compare ASVs against DNA yield
summary(lm(ps.asv2$ASVs ~ ps.asv2$DNAconc))
## 
## Call:
## lm(formula = ps.asv2$ASVs ~ ps.asv2$DNAconc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -186.05 -131.73  -34.20   81.03  360.82 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     227.7023    47.4950   4.794 3.61e-05 ***
## ps.asv2$DNAconc   0.1434     0.5269   0.272    0.787    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156 on 32 degrees of freedom
## Multiple R-squared:  0.002309,   Adjusted R-squared:  -0.02887 
## F-statistic: 0.07405 on 1 and 32 DF,  p-value: 0.7873
pl.asv_dna.scp <- ggplot(ps.asv2, aes(x=DNAconc, y=ASVs)) + 
  geom_point()+
  geom_smooth(method=lm)
rm(tmp1, tmp2)

# Visualize
ggarrange(pl.asv_dna.scp, pl.asv_red.scp, ncol=2, nrow=1)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Figure 12. Linear regressions umber of ASVs as a function of DNA extraction yields (left), and sequencing depth (right)

Is the number of ASVs biased by the DNA extraction yields? Is it statistically significant?
> No, it is not biased because the points are randomly distributed and not necessarily around the blue line. We can say our regression does not find a trend in our datapoints. Moreover, the p-value is not significant (p = 0.787).

Is the number of ASVs biased by the sequencing depth? Is it statistically significant?
> Yes, it is biased by the sequencing depth. The points are concentrated around the line. The p-value is significant (p = 9.56e-08). So the number of ASVs is correlated with the sequencing depth.

5.5 Composition

## First lets count how many Phyla are in each sample
get_taxa_unique(ps,taxonomic.rank = "Phylum")
##  [1] "Firmicutes"                    "Actinobacteriota"             
##  [3] "Proteobacteria"                "Desulfobacterota"             
##  [5] "Bacteroidota"                  "Verrucomicrobiota"            
##  [7] "Cyanobacteria"                 "Deferribacterota"             
##  [9] "Marinimicrobia (SAR406 clade)" "Spirochaetota"                
## [11] NA                              "Campilobacterota"             
## [13] "SAR324 clade(Marine group B)"  "Acidobacteriota"              
## [15] "Chloroflexi"                   "Planctomycetota"              
## [17] "Myxococcota"                   "Fibrobacterota"               
## [19] "NB1-j"                         "Nitrospirota"                 
## [21] "Abditibacteriota"              "Gemmatimonadota"              
## [23] "Dadabacteria"                  "Armatimonadota"               
## [25] "Bdellovibrionota"              "Entotheonellaeota"            
## [27] "Fusobacteriota"                "Methylomirabilota"            
## [29] "Nitrospinota"                  "Synergistota"                 
## [31] "RCP2-54"                       "PAUC34f"                      
## [33] "Elusimicrobiota"               "WPS-2"                        
## [35] "LCP-89"                        "FCPU426"                      
## [37] "Latescibacterota"              "Deinococcota"                 
## [39] "Patescibacteria"               "Deferrisomatota"              
## [41] "Cloacimonadota"                "Acetothermia"                 
## [43] "Dependentiae"                  "Calditrichota"                
## [45] "Caldatribacteriota"            "AncK6"                        
## [47] "Margulisbacteria"              "Sumerlaeota"
# Let''s look at the distribution of phyla in each sample
ps.phy <- tax_glom(ps, taxrank="Phylum")
ps.phy.df <- psmelt(ps.phy)
# Plot each treatment group next to each other
ps.phy.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Phylum")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the phylum level")

K22_df <- ps.phy.df[ps.phy.df$id_sample == "K22",]

nrow(K22_df)
## [1] 47
# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)

# Plot each treatment group next to each other
ps.gen.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level")

Figure 13. Relative abundance for every sample, at the phylum and genus level.

Are there any dominant taxon?
> The dominant taxon is Firmicutes at the phylum level, and it is Lactobacillus at the genus level.

Is this similar to what has been reported in the literature?
> Yes, it is similar to what has been reported in the literature.

5.6 Filtering

library(genefilter)
ps.prop = transform_sample_counts(ps, function(x) {x/sum(x)})
ps.filt = filter_taxa(ps.prop, filterfun(kOverA(1, 0.01)), TRUE)

# Then save an object with the same ASVs but with raw counts
toKeep <- taxa_names(ps.filt)
ps.rawfilt <- prune_taxa(toKeep, ps)

# Fix tax_table
tmptax <- data.frame(tax_table(ps.filt))
tmptax$Linnaeus <- paste(tmptax$Genus,tmptax$Species)
tax_table(ps.filt) <- as.matrix(tmptax)
tax_table(ps.rawfilt) <- as.matrix(tmptax)

# Plot again at Genus level
plot_bar(ps.filt, x="id_sample",fill="Genus")

plot_bar(ps.rawfilt, x="id_sample",fill="Genus")

# Plot at species level
plot_bar(ps.filt, x="id_sample",fill="Linnaeus" ) +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
  guides(color=guide_legend(ncol=1, byrow=FALSE))

plot_bar(ps.rawfilt, x="id_sample",fill="Linnaeus" ) +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="right")+
  guides(color=guide_legend(ncol=1, byrow=FALSE))

Figure 14. Abundance at the genus level, before and after filtering; abundance at the species level, before and after filtering.

Did rare ASVs that were removed from this filtering step represent a large proportion of all ASVs in our samples?
> No, rare ASVs that were removed from this filtering step did not represent a large proportion of all ASVs in our samples. We can see that by comparing the plot for the filtered dataset with the first plot.

5.7 Alpha-diversity

# Question 1
ps.rawfilt.q1 = subset_samples(ps.rawfilt, time!="END")
ps.q3 = subset_samples(ps, time!="START")
ps.filt.q3 = subset_samples(ps.filt, time!="START")
ps.rawfilt.q3 = subset_samples(ps.rawfilt, time!="START")
pl.alpha <- plot_richness(ps.rawfilt, color = "inoculum", x="inoculum", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## inoculum     2  26.08  13.042   2.748 0.0807 .
## reads        1  27.83  27.830   5.864 0.0219 *
## DNAconc      1   3.39   3.394   0.715 0.4046  
## Residuals   29 137.63   4.746                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## inoculum     2 0.6903  0.3452   4.932 0.0143 *
## reads        1 0.0190  0.0190   0.271 0.6064  
## DNAconc      1 0.1750  0.1750   2.500 0.1247  
## Residuals   29 2.0293  0.0700                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Simpson ~ inoculum + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df  Sum Sq Mean Sq F value Pr(>F)  
## inoculum     2 0.07078 0.03539   3.751 0.0356 *
## reads        1 0.00193 0.00193   0.204 0.6547  
## DNAconc      1 0.02409 0.02409   2.553 0.1209  
## Residuals   29 0.27362 0.00944                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How similar are the communities in the two kefir starter cultures?
> We can see that the two starters communities have similar richness (p = 0.0807) in the leftmost plot (“Observed”), but Shannon and Simpson are significantly different (p = 0.0143 and 0.0356, respectively). The Shannon value is a measure of evenness, weighing equally richness and abundance, and the Simpson measure is the probability of encounter, and thus a measure of dominance. So our cultures have similar richness but different evenness and dominance. In all three plots, the commmunity from Philipp looks very close to the start culture, whereas Vincent’s community is more different.

# Question 2
pl.alpha <- plot_richness(ps.rawfilt, color = "time", x="time", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

# Let's run a simple ANOVA on alpha diversity indices
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## time         4  31.02    7.75   1.681 0.1834  
## reads        1  33.91   33.91   7.351 0.0115 *
## DNAconc      1   5.48    5.48   1.187 0.2855  
## Residuals   27 124.54    4.61                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Shannon ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## time         4 0.7282 0.18205   2.510 0.0653 .
## reads        1 0.0152 0.01518   0.209 0.6510  
## DNAconc      1 0.2118 0.21183   2.921 0.0989 .
## Residuals   27 1.9584 0.07253                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ps.alpha.aov <- aov(Simpson ~ time + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df  Sum Sq  Mean Sq F value Pr(>F)  
## time         4 0.07411 0.018527   1.892 0.1408  
## reads        1 0.00188 0.001877   0.192 0.6650  
## DNAconc      1 0.03005 0.030048   3.069 0.0912 .
## Residuals   27 0.26438 0.009792                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How stable are the communities over time?
> Communities are stable over time, and we see that the start and T4 are very similar. Moreover, the p-value are not significant (p = 0.1834, 0.0653, 0.1408) between our time variable and our community composition.

# Question 3
pl.alpha <- plot_richness(ps.rawfilt.q3, color = "fruit", x="fruit", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

# Let's run a simple ANOVA on richness (the other metrics are non-linear)
ps.alpha <- estimate_richness(ps.rawfilt.q3, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt.q3), ps.alpha)
ps.dataA$reads <- sample_sums(ps.q3)
ps.alpha.aov <- aov(Observed ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## fruit        1   0.01    0.01   0.002  0.968
## reads        1   7.97    7.97   1.393  0.249
## DNAconc      1  14.67   14.67   2.564  0.121
## Residuals   26 148.72    5.72
ps.alpha.aov <- aov(Shannon ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)
## fruit        1 0.0237 0.02366   0.262  0.613
## reads        1 0.1455 0.14548   1.610  0.216
## DNAconc      1 0.0522 0.05220   0.578  0.454
## Residuals   26 2.3500 0.09039
ps.alpha.aov <- aov(Simpson ~ fruit + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df  Sum Sq  Mean Sq F value Pr(>F)
## fruit        1 0.00356 0.003563   0.326  0.573
## reads        1 0.01298 0.012979   1.189  0.286
## DNAconc      1 0.00563 0.005632   0.516  0.479
## Residuals   26 0.28390 0.010919

Do they change in response to the environment?
> We can see that the fruit do not influence our diversity, and the p-values for “Observed”, “Shannon” and “Simpson” are not significant (p = 0.968, 0.613, 0.573).

pl.alpha <- plot_richness(ps.rawfilt, color = "treatment", x="treatment", measures=c("Observed","Shannon", "Simpson"))
pl.alpha.box <- pl.alpha + geom_boxplot()
plot(pl.alpha.box)

Figure 15. Three different measures of alpha diversity, colors represent the origin of the inoculum (first plot), time (second plot), fruit (third plot) and treatment (fourth plot).

# Let's run a simple ANOVA on richness (the other metrics are non-linear)
ps.alpha <- estimate_richness(ps.rawfilt, measures=c("Observed","Shannon", "Simpson"))
ps.dataA <- cbind(sample_data(ps.rawfilt), ps.alpha)
ps.dataA$reads <- sample_sums(ps)
ps.alpha.aov <- aov(Observed ~ treatment + reads + DNAconc, ps.dataA)
summary(ps.alpha.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## treatment    4  31.02    7.75   1.681 0.1834  
## reads        1  33.91   33.91   7.351 0.0115 *
## DNAconc      1   5.48    5.48   1.187 0.2855  
## Residuals   27 124.54    4.61                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.8 Composition

# Now let's do the same thing at the genus level, displaying only the 20 most abundant genera
# select top 10 most prevalent Families over the whole dataset
ps.gen <- tax_glom(ps.filt.q3, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)

# Plot each treatment group next to each other
ps.gen.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  facet_grid(~ inoculum, scales = "free_x", space = "free_x") +
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level between Philipp's and Vincent's kefir cultures")

ps.gen <- tax_glom(ps.filt, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen), TRUE)[1:10])
ps.gen <- prune_taxa(Top10_gen, ps.gen)
# Melt them for plotting
ps.gen.df <- psmelt(ps.gen)

ps.gen.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  facet_grid(~ time, scales = "free_x", space = "free_x") +
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level between start and all the time points")

ps.gen.q3 <- tax_glom(ps.filt.q3, taxrank="Genus")
Top10_gen <- names(sort(taxa_sums(ps.gen.q3), TRUE)[1:10])
ps.gen.q3 <- prune_taxa(Top10_gen, ps.gen.q3)
# Melt them for plotting
ps.gen.q3.df <- psmelt(ps.gen.q3)

ps.gen.q3.df %>%
  ggplot(aes_string(x = "Sample", y = "Abundance", fill = "Genus")) +
  geom_bar(stat = "identity", position = "stack") + 
  theme_bw() + 
  facet_grid(~ fruit, scales = "free_x", space = "free_x") +
  theme(legend.position = "bottom", axis.text.x =  element_text(angle = 80, vjust = 0.5)) +
  ggtitle("Relative abundance at the Genus level between figs and raisins")

Figure 16. Relative abundance for every sample, at the genus level, grouped by the origin of the culture (first plot), the different time points (second plot) and the food (third plot).

Do you see any pattern between comparison groups?
> No, we do not see any pattern, all abundance levels are similar. The different characteristics do not seem to have an influence: Philip vs Vincent, timepoints or food. In all cases, we find the same genus, Lactobacillus is largely dominant and we also find some Acetobacter, Gluconobacter and Staphylococcus.

How much influence do you think the starting culture had on the rest of the samples?
> All our samples come from the same starting culture, so it seems to be the most influential factor. However, we did not try with any other starter cultures so we can not do a comparison. As the communities appear not to be affected by the environment they were cultivated in and remain similar to the original culture, it is highly possible that the starter culture is the most influential factor.

# Plot the proportions to better compare the communities
pl.com.pro <- plot_bar(ps.filt, x="id_sample",fill="Linnaeus") +        
  facet_wrap(~treatment, scales="free_x", nrow=1) +
  theme(legend.title=element_blank(),legend.text = element_text(size = 8), legend.position="bottom")+
  guides(color=guide_legend(ncol=6, byrow=FALSE))
pl.com.pro

Figure 17. Relative abundance at the species level, grouped by treatment (origin of culture and food).

5.9 Beta-diversity

# Ordinate using Principal Coordinate Analysis & NMDS
ps.nmds.bray <- ordinate(ps.rawfilt, "NMDS", "bray",trymax=50)
# Check stress level and dimensions used
ps.nmds.bray
## 
## Call:
## metaMDS(comm = veganifyOTU(physeq), distance = distance, trymax = 50) 
## 
## global Multidimensional Scaling using monoMDS
## 
## Data:     wisconsin(sqrt(veganifyOTU(physeq))) 
## Distance: bray 
## 
## Dimensions: 2 
## Stress:     0.1310455 
## Stress type 1, weak ties
## No convergent solutions - best solution after 50 tries
## Scaling: centring, PC rotation, halfchange scaling 
## Species: expanded scores based on 'wisconsin(sqrt(veganifyOTU(physeq)))'
# Plot NMDS
pl.nmds.bray <- plot_ordination(ps.rawfilt, ps.nmds.bray, color="treatment",title="NDMS of Bray-Curtis dissimilarities",label="id_sample" ) +
  stat_ellipse(aes(group = inoculum))
pl.nmds.bray

Figure 18. Non-metric multidimensional scaling (NMDS) of all the samples based on Bray-Curtis dissimilarities, colored by treatment.

# Calculate Bray-Curtis distance
ps.bray <- phyloseq::distance(ps.rawfilt, "bray")

# Test with ADONIS
ps.adonis.int <- adonis2(ps.bray ~ reads + (inoculum*fruit), data = ps.dataA, perm =9999)
ps.adonis.int
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 9999
## 
## adonis2(formula = ps.bray ~ reads + (inoculum * fruit), data = ps.dataA, permutations = 9999)
##                Df SumOfSqs      R2      F Pr(>F)    
## reads           1  0.41510 0.15986 7.9239 0.0001 ***
## inoculum        2  0.62683 0.24140 5.9829 0.0001 ***
## fruit           1  0.04455 0.01716 0.8504 0.4860    
## inoculum:fruit  1  0.04335 0.01669 0.8275 0.4972    
## Residual       28  1.46678 0.56489                  
## Total          33  2.59661 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What seems to be the most influential factor explaining the differences between bacterial kefir communities?
> The most influential factor is the inoculum, because Philipp’s and Vincent’s starters are separated, but we find both fruits in both clusters. The p-value for inoculum is very significant (p = 0.0001), but the p-value for fruit is not significant (p = 0.4964).

6 Discussion

The analysis of 16S rRNA amplicon sequencing data revealed that the most abundant ASVs were assigned to the phyla Firmicutes. Furthermore, Proteobacteria and Actinobacteriota were also present in our samples. While 16S rRNA amplicon sequencing typically does not provide identification at the species or strain level, the taxonomic tool used in our study allowed us to achieve a finer resolution approaching the species level. As anticipated, a significant proportion of ASVs were assigned to Lactobacillus species (Firmicutes). These findings align with the species previously characterized as key members of the water kefir community. Despite the fermentation steps taking place in different households and non-sterile conditions, the composition and structure of the various communities remained relatively stable over time. When analyzing the influence of the four treatments on the water kefir community, it became evident that the microbial species diversity of the samples was primarily determined by the starter culture. In the metagenomic analysis, the majority of reads were assigned to the Lactobacillaceae family, consistent with the 16S analysis and existing literature.